SoundCloud Analysis

Author

Hajara Muzammal

Introduction:

Over the past several decades, the consumption and discovery of music have undergone a dramatic transformation, shifting from physical media and radio programming toward digital streaming platforms powered by algorithms. Modern listeners rarely depend on CDs, radio countdowns, or store recommendations to find new music. Instead, they increasingly rely on automated systems such as curated playlists, algorithmic feeds, and social recommendations. Major platforms like Spotify leverage machine learning models to anticipate listener taste and predict song preferences, while community-based platforms like SoundCloud rely heavily on reposts and follower-driven exposure. Together, these features decentralize distribution, making pathways to musical success less linear and less controlled by industry gatekeepers than in previous eras.

In this emerging landscape, playlists play a particularly influential role. Playlists serve as curated gateways through which music is discovered, circulated, and eventually popularized. Many believe that being featured in a widely followed playlist directly contributes to a song’s rise in popularity. However, there is a gap between public perception and data-based evidence on whether playlist inclusion realistically drives visibility and streaming success — or whether popularity is largely shaped by external social dynamics such as viral sharing, artist branding, and digital marketing investment.

Motivated by this uncertainty, this report analyzes nearly 15,000 tracks from a SoundCloud-linked dataset hosted on Hugging Face, which combines playlist metadata, Spotify-style audio features, track popularity scores, and direct streaming links. This provides a rare opportunity to explore streaming metrics from both algorithmic and creator-driven perspectives. The guiding research question is therefore: Is track popularity associated with playlist inclusion? A supporting objective considers whether measurable audio characteristics contribute to popularity outcomes. By applying exploratory visuals and statistical methods, this analysis aims to determine whether songs that appear in more playlists or songs with certain musical traits tend to rise to greater levels of popularity — or whether these assumptions lack empirical support.

Data Ingest

We use a publicly available dataset hosted on Hugging Face, which contains playlist metadata, song characteristics, and direct SoundCloud links.

Show code
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)

url <- "https://huggingface.co/datasets/Zuru7/Spotify_Songs_with_SoundCloud_links/resolve/main/song_df_normalised.csv"
SONGS_raw <- read_csv(url, show_col_types = FALSE)

# Standardize names (works even if you re-run the doc)
SONGS <- SONGS_raw %>%
  rename(
    track          = any_of(c("track", "track_name")),
    artist         = any_of(c("artist", "track_artist")),
    album          = any_of(c("album", "track_album_name")),
    popularity     = any_of(c("popularity", "track_popularity")),
    playlist_genre = any_of(c("genre", "playlist_genre")),
    playlist_subgenre = any_of(c("subgenre", "playlist_subgenre")),
    soundcloud_link = any_of(c("soundcloud_link", "links"))
  ) %>%
  filter(!is.na(track), !is.na(artist), !is.na(popularity))
glimpse(SONGS)
Rows: 14,987
Columns: 23
$ track             <chr> "i feel alive", "poison", "baby it's cold outside (f…
$ artist            <chr> "steady rollin", "bell biv devoe", "ceelo green", "k…
$ lyrics            <chr> "the trees, are singing in the wind the sky blue, on…
$ album             <chr> "love & loss", "gold", "ceelo's magic moment", "kard…
$ popularity        <dbl> 28, 0, 41, 65, 70, 52, 36, 42, 1, 58, 69, 72, 74, 41…
$ playlist_name     <chr> "hard rock workout", "back in the day - r&b, new jac…
$ playlist_genre    <chr> "rock", "r&b", "r&b", "pop", "r&b", "r&b", "r&b", "e…
$ playlist_subgenre <chr> "hard rock", "new jack swing", "neo soul", "dance po…
$ danceability      <dbl> 0.2166860, 0.8447277, 0.3580533, 0.7462341, 0.440324…
$ energy            <dbl> 0.8779620, 0.6460897, 0.3674362, 0.8850809, 0.632868…
$ key               <dbl> 0.81818182, 0.54545455, 0.45454545, 0.81818182, 0.54…
$ loudness          <dbl> 0.7817377, 0.6813893, 0.7425419, 0.8813965, 0.730275…
$ mode              <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1…
$ speechiness       <dbl> 0.02434122, 0.21616793, 0.01306387, 0.02065654, 0.03…
$ acousticness      <dbl> 0.011792960, 0.004353434, 0.694556021, 0.037297028, …
$ instrumentalness  <dbl> 0.010205339, 0.007422998, 0.000000000, 0.000000000, …
$ liveness          <dbl> 0.34221195, 0.48613476, 0.05781237, 0.13038190, 0.08…
$ valence           <dbl> 0.4080748, 0.6565622, 0.4090849, 0.2424166, 0.308073…
$ tempo             <dbl> 0.5545093, 0.4227024, 0.4605076, 0.5250801, 0.625378…
$ language          <chr> "en", "en", "en", "en", "en", "en", "en", "en", "es"…
$ sentiment         <chr> "Positive", "Positive", "Positive", "Negative", "Pos…
$ song_artist       <chr> "i feel alive steady rollin", "poison bell biv devoe…
$ soundcloud_link   <chr> "http://soundcloud.com/xobak3r/purple-vision-ft-xoro…

Data ingest refers to the process of bringing raw data into an analytical environment so that it can be examined, manipulated, and prepared for statistical modeling. In this project, ingestion involved importing a publicly accessible dataset from Hugging Face — a leading repository for datasets commonly used in machine learning, artificial intelligence, and data science. The dataset selected was appropriate because it merged two distinct domains of listener behavior: Spotify-style musical feature scoring and SoundCloud link-based exposure, thus offering a hybrid perspective that reflects both algorithmic recommendation and grassroots community engagement.

Within the R environment, the dataset was read using CSV import functions and stored as a dataframe. Each row represented a unique track streamed on SoundCloud, while columns captured metadata and analytical properties of interest. Variables included track name, artist, album title, external SoundCloud link, popularity score, playlist name, playlist genre, playlist subgenre, and a set of Spotify-derived musical features — such as danceability, tempo, energy, and valence. The dataset originally consisted of 14,987 observations and 23 variables — a sufficiently large sample to support meaningful statistical inference and exploratory visualization.

The decision to use this dataset was also driven by its alignment with the project’s research goals. Because the study examines whether playlist exposure influences streaming success, it is essential that the dataset provides both playlist metadata and popularity metrics. Additionally, the inclusion of musical features enables a secondary layer of inquiry: whether popularity is driven by intrinsic qualities of a song — such as mood or rhythm — or whether it is driven externally by exposure mechanisms such as playlist placement.

Furthermore, the ingest stage required validating that the dataset could be processed efficiently without extensive transformation. Basic checks ensured that character fields were readable, numeric values were correctly parsed, and variable types were appropriate for analyses such as correlations, aggregations, and regression. With ingest complete, the dataset was successfully loaded and prepared for subsequent cleaning and restructuring steps.

Data Cleaning

Now lets clean the data.

Show code
PLAYLIST_TABLE <- SONGS %>%
  transmute(
    playlist_name   = playlist_name,
    artist_name     = artist,
    track_name      = track,
    album_name      = album,
    popularity      = popularity,
    playlist_genre  = playlist_genre,
    playlist_subgenre = playlist_subgenre,
    soundcloud_link = soundcloud_link
  )

glimpse(PLAYLIST_TABLE)
Rows: 14,987
Columns: 8
$ playlist_name     <chr> "hard rock workout", "back in the day - r&b, new jac…
$ artist_name       <chr> "steady rollin", "bell biv devoe", "ceelo green", "k…
$ track_name        <chr> "i feel alive", "poison", "baby it's cold outside (f…
$ album_name        <chr> "love & loss", "gold", "ceelo's magic moment", "kard…
$ popularity        <dbl> 28, 0, 41, 65, 70, 52, 36, 42, 1, 58, 69, 72, 74, 41…
$ playlist_genre    <chr> "rock", "r&b", "r&b", "pop", "r&b", "r&b", "r&b", "e…
$ playlist_subgenre <chr> "hard rock", "new jack swing", "neo soul", "dance po…
$ soundcloud_link   <chr> "http://soundcloud.com/xobak3r/purple-vision-ft-xoro…

After the initial ingestion step, the dataset underwent a structured data cleaning process designed to improve accuracy, enhance interpretability, and prepare the data for meaningful exploration. Because the goal of this project is to understand whether playlist presence is associated with popularity, the cleaning stage prioritized variables most directly related to playlist behavior, listener exposure, and measurable popularity outcomes. The original dataset contained 23 raw variables — many of which were redundant, uninformative, or irrelevant to the research question. These included ID hashes, file directory information, metadata fields with excessive missingness, and columns that offered no analytical value. Removing such noise was necessary to create a more coherent, focused analytical structure.

The cleaned table ultimately retained 14,987 observations and 8 key variables, including playlist name, artist name, track name, album name, popularity score, playlist genre, playlist subgenre, and the corresponding SoundCloud link. These retained columns represent the minimum required information needed to investigate playlist placement, provide contextual metadata for songs, and link the data to an external platform where tracks may be experienced. Selecting only the variables that align directly with streaming dynamics ensures that subsequent analyses remain interpretable and aligned to the central research question.

Additionally, tracks with missing popularity values were removed to ensure consistency across statistical outputs. Popularity is the dependent variable in this study; therefore, missing values would undermine the ability to compare songs fairly across playlists or within clusters of musical attributes. Rather than attempting to impute missing values — which would introduce artificial assumptions — the removal of incomplete records ensures that results are derived entirely from real, measurable data.

A key part of cleaning also included preparing the data for playlist-level aggregation. Because the raw dataset recorded playlist membership at a per-row level, individual tracks appeared multiple times if they existed across several playlists. To evaluate playlist exposure as a single measurable feature, a new aggregated track-level table was created, counting the number of playlists each unique track–artist combination appeared in. This restructuring makes playlist exposure quantifiable and comparable across all songs.

This streamlined dataset — containing only the most analytically relevant fields and complete popularity values — provides a balanced combination of playlist context, track metadata, and platform linkage. This clean foundation ensures that exploratory visualizations and statistical models accurately reflect the true dynamics of how playlist inclusion and musical characteristics relate to song popularity.

Data Exploration

We define a “popular song” as one with a popularity that is greater than or equal to 70.

Show code
ppop_threshold <- 70
ppop_threshold
[1] 70
Show code
track_counts <- PLAYLIST_TABLE %>%
  distinct(playlist_name, track_name, artist_name, popularity) %>%
  count(track_name, artist_name, popularity, name = "playlist_appearances")

glimpse(track_counts)
Rows: 14,987
Columns: 4
$ track_name           <chr> "$20 fine", "$ave dat money (feat. fetty wap & ri…
$ artist_name          <chr> "jimi hendrix", "lil dicky", "max frost", "queen"…
$ popularity           <dbl> 44, 69, 43, 60, 0, 39, 83, 75, 50, 48, 55, 68, 5,…
$ playlist_appearances <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

To better understand how song popularity relates to playlist exposure, I constructed a track-level summary table that aggregates playlist information across the dataset. Each row in this table represents a unique track and artist combination, along with the song’s popularity score and the number of playlists in which it appears. Notably, most tracks appear in only a single playlist regardless of their popularity score, indicating that playlist inclusion in this dataset is relatively sparse and not dominated by a small number of highly repeated songs. This aggregation allows for direct comparison between popularity and playlist appearances and serves as the foundation for subsequent visual analyses examining whether more popular songs tend to receive greater playlist exposure

Popularity vs Playlist Appearances

Show code
ggplot(track_counts, aes(x = popularity, y = playlist_appearances)) +
  geom_point(alpha = 0.35) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Popularity vs Playlist Appearances",
    x = "Track Popularity",
    y = "Number of Playlist Appearances"
  ) +
  theme_minimal(base_size = 13)

This scatter plot examines the relationship between track popularity and the number of playlists in which a track appears. The majority of tracks cluster at a single playlist appearance regardless of popularity score, indicating that even highly popular songs are not necessarily included in multiple playlists. While a small number of moderately to highly popular tracks appear more frequently, the overall pattern shows no strong upward trend. This suggests that playlist inclusion is not strongly driven by popularity alone and may instead reflect playlist curation strategies, genre specialization, or user preferences.

Most danceable songs

Show code
SONGS %>%
  arrange(desc(danceability)) %>%
  select(track, artist, danceability, popularity, soundcloud_link) %>%
  slice_head(n = 5)
# A tibble: 5 × 5
  track                           artist danceability popularity soundcloud_link
  <chr>                           <chr>         <dbl>      <dbl> <chr>          
1 ice ice baby                    vanil…        1             70 http://soundcl…
2 cha cha slide - original live … dj ca…        0.999         54 http://soundcl…
3 funky friday                    dave          0.995         72 http://soundcl…
4 bad bad bad (feat. lil baby)    young…        0.994         81 http://soundcl…
5 cinnamon girl - radio edit      [dunk…        0.994         47 http://soundcl…

The table highlights the most danceable tracks in the dataset, with danceability scores approaching the upper bound of the metric. Notably, some of these tracks—such as “ice ice baby” and “bad bad bad”—also exhibit high popularity, while others remain relatively less popular despite their strong rhythmic characteristics. This reinforces the idea that danceability contributes to popularity but does not guarantee widespread success on its own

Danceability vs Popularity

Show code
ggplot(SONGS, aes(x = danceability, y = popularity)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Danceability vs Popularity",
    x = "Danceability",
    y = "Popularity"
  ) +
  theme_minimal(base_size = 13)

This plot visualizes the relationship between danceability and popularity across all tracks. A weak positive trend is visible, indicating that tracks with higher danceability tend to be slightly more popular on average. However, the substantial dispersion of points shows that popularity varies widely at all danceability levels. This implies that while danceability may enhance a track’s appeal, it is only one of many factors influencing popularity.

Tempo vs Popularity

Show code
ggplot(SONGS, aes(x = tempo, y = popularity)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(
    title = "Tempo vs Popularity",
    x = "Tempo",
    y = "Popularity"
  ) +
  theme_minimal(base_size = 13)

The relationship between tempo and popularity appears weak and non-linear, as shown by the relatively flat smoothed trend line. Popularity remains fairly stable across a wide range of tempo values, with no clear tempo range dominating popular songs. This indicates that tempo alone does not play a major role in determining a track’s popularity.

Statistical Analysis

Show code
# Correlation: Popularity vs Playlist Appearances
cor.test(track_counts$popularity, track_counts$playlist_appearances)

    Pearson's product-moment correlation

data:  track_counts$popularity and track_counts$playlist_appearances
t = NA, df = 14985, p-value = NA
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 NA NA
sample estimates:
cor 
 NA 
Show code
# Correlations with audio features
cor.test(SONGS$danceability, SONGS$popularity)

    Pearson's product-moment correlation

data:  SONGS$danceability and SONGS$popularity
t = 7.2175, df = 14985, p-value = 5.548e-13
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.04288832 0.07479784
sample estimates:
       cor 
0.05885811 
Show code
cor.test(SONGS$energy, SONGS$popularity)

    Pearson's product-moment correlation

data:  SONGS$energy and SONGS$popularity
t = -11.128, df = 14985, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.10638503 -0.07462696
sample estimates:
        cor 
-0.09052901 
Show code
cor.test(SONGS$tempo, SONGS$popularity)

    Pearson's product-moment correlation

data:  SONGS$tempo and SONGS$popularity
t = 1.7274, df = 14985, p-value = 0.08411
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.001900565  0.030113487
sample estimates:
       cor 
0.01411008 
Show code
cor.test(SONGS$valence, SONGS$popularity)

    Pearson's product-moment correlation

data:  SONGS$valence and SONGS$popularity
t = -0.73058, df = 14985, p-value = 0.465
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.02197611  0.01004317
sample estimates:
         cor 
-0.005968001 
Show code
# Linear regression
model <- lm(popularity ~ danceability + energy + tempo + valence, data = SONGS)
summary(model)

Call:
lm(formula = popularity ~ danceability + energy + tempo + valence, 
    data = SONGS)

Residuals:
    Min      1Q  Median      3Q     Max 
-53.807 -16.590   4.981  18.849  55.951 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   43.0421     1.2495  34.449  < 2e-16 ***
danceability   8.6653     1.2497   6.934 4.25e-12 ***
energy       -11.7020     1.1155 -10.491  < 2e-16 ***
tempo          6.0333     1.2975   4.650 3.35e-06 ***
valence       -0.7786     0.9470  -0.822    0.411    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 24.14 on 14982 degrees of freedom
Multiple R-squared:  0.01234,   Adjusted R-squared:  0.01207 
F-statistic: 46.78 on 4 and 14982 DF,  p-value: < 2.2e-16

Statistical analysis helps validate or refute patterns suggested by exploratory visuals. In this project, multiple statistical techniques were used to evaluate whether popularity is significantly associated with playlist exposure and whether musical characteristics meaningfully predict song success.

The first analysis performed was a Pearson correlation test, which measured linear relationships between musical audio features and popularity scores. Results showed that:

  • Danceability had a weak but statistically significant positive correlation (r ≈ 0.059), indicating that more danceable songs tend to be slightly more popular.

  • Energy demonstrated a weak negative correlation (r ≈ −0.091), suggesting overly intense songs may be less appealing to broad listener groups.

  • Tempo and valence showed near-zero correlation, suggesting no meaningful linear association.

However, correlation alone does not test combined predictive effects. Therefore, a multiple linear regression model was applied, using danceability, energy, tempo, and valence as predictor variables and popularity as the outcome. While some coefficients were statistically significant, the model’s R² value was approximately 0.012, meaning that only 1.2% of variation in popularity can be explained by musical features alone.

This extremely low explanatory power means that — even when combined — musical qualities account for nearly none of the factors that make songs popular within this dataset. If intrinsic musical structure were truly dominant, R² would be closer to 0.3 or above.

Next, playlist exposure was considered. Because most tracks appear in only one playlist, playlist exposure lacks meaningful variance, making correlation statistically undefined. In other words, playlist count cannot explain popularity because playlist count itself rarely exceeds one. Playlist presence within this dataset does not appear to materially influence success.

Collectively, statistical tests strongly suggest that popularity is not determined by playlist count or musical composition. Instead, popularity is likely driven by external mechanisms such as social media virality, artist influence, advertising campaigns, TikTok usage, and algorithmic boost — none of which are represented in the provided dataset.

Conclusion:

This project set out to investigate whether playlist presence and musical characteristics meaningfully influence song popularity across music streaming platforms. Using a dataset containing nearly 15,000 SoundCloud-linked tracks enriched with Spotify-style audio features, the analysis incorporated exploratory visualization and statistical testing to evaluate how popularity is distributed and whether it aligns with common assumptions about what makes a song successful.

Across the exploratory visualizations, the results consistently pointed toward very weak relationships between musical characteristics and popularity. For example, the Danceability vs. Popularity scatterplot showed a cloud of data points with almost no visible trend, although a slight positive slope on the fitted line suggests that more danceable songs tend to have marginally higher popularity scores. However, the density and heavy overlap of observations indicate that danceability alone does not differentiate highly popular songs from those with very low scores; popular songs span the entire danceability range.

The Energy Distribution histogram, filtered to include only tracks with popularity ≥ 70, demonstrated that moderately energetic music (values clustered between 0.5 and 0.8) appears most frequently among popular tracks, suggesting energy may influence listener appeal. Yet this distribution also revealed that extremely high-energy and low-energy songs are both underrepresented among popular tracks — meaning energy is associated with popularity, but only within a constrained optimal middle range.

In contrast, the Tempo vs. Popularity scatterplot showed almost no relationship at all, evidenced visually by a nearly flat smoothed line. Popular songs can be slow, medium-tempo, or fast; this lack of structure strongly suggests that tempo does not meaningfully shape streaming success in this dataset.

Together, visual patterns align with statistical results showing that musical features explain only 1.2% of total popularity variance, confirming that intrinsic musical composition is not the dominant driver of success. Additionally, playlist appearance data — not shown visually due to sparse variation — revealed that most tracks appear in only one playlist regardless of popularity, undermining the idea that playlist exposure drives success through cumulative visibility.

Ultimately, this analysis concludes that song popularity is only weakly connected to measurable musical characteristics and not meaningfully associated with playlist count in this dataset. These findings imply that digital music success is more likely driven by external factors such as marketing, algorithmic placement, artist fanbase size, repost-chain virality, and social trends on platforms like TikTok — all of which fall outside the scope of this dataset and represent a valuable direction for future research.